Section A

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As we can see from the histogram above, there are many properties sold at strange values, like at 0 or 1 dollar. This could have been a part of some revitalization program where property is sold cheaply to residents or reclaimed by the city, as evidenced by the grantor being some government entity like CITY OF DETROIT and sale terms like EXEMPT/GOVT but there isn’t enough information confirm the strange values. The is true of assessed values as well. What does it mean for a property to have a zero assessed value? Considering that even “arm’s length” transactions have this issue, there appears to be a potential data ingestion issue.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

But according to the definition of the various sales terms, only “arm’s length” sales are supposedly representative of the market. So we will examine only those. For the sake of simplicity let’s restrict the data to the valuations to the lower half first initial inspection.

Here we note a sharp difference in the shapes of the distribution of the assessment values versus actual sale values. Also, strangely, there appears to be a sharp cutoff around \(\$20,000\) in the assessed values samples.

As the University of Chicago analysts note, the typical way to examine fairness of assessments is through the ratios of the assessed values divided by the sales price.

Section B

## [1] "Filtered out non-arm's length transactions"
## [1] "Inflation adjusted to 2020"
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Looking at the various trends in sales ratios over time we see that, no matter how expensive the property is, the sales ratio and the assessed value decreases over time, while the actual sales price increases, especially from 2016 onwards.

Section C

## 
## Call:
## lm(formula = sale_price ~ property_c, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -579129  -23451  -10951    7049 5262049 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2628106.9    20296.1  -129.5   <2e-16 ***
## property_c      6648.5       50.6   131.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56820 on 544725 degrees of freedom
## Multiple R-squared:  0.03072,    Adjusted R-squared:  0.03072 
## F-statistic: 1.726e+04 on 1 and 544725 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = sale_price ~ ASSESSEDVALUE, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -834659  -19514   -8158    8410 5286288 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.316e+04  1.012e+02   130.0   <2e-16 ***
## ASSESSEDVALUE 1.383e+00  3.937e-03   351.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 52110 on 544725 degrees of freedom
## Multiple R-squared:  0.1848, Adjusted R-squared:  0.1848 
## F-statistic: 1.235e+05 on 1 and 544725 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = sale_price ~ property_c + ASSESSEDVALUE + year, 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -842647  -19172   -7772    8803 5286669 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.787e+06  4.465e+04 -62.425  < 2e-16 ***
## property_c     6.312e+03  4.562e+01 138.368  < 2e-16 ***
## ASSESSEDVALUE  1.373e+00  3.872e-03 354.597  < 2e-16 ***
## year           1.332e+02  2.017e+01   6.602 4.06e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51210 on 544723 degrees of freedom
## Multiple R-squared:  0.2125, Adjusted R-squared:  0.2125 
## F-statistic: 4.9e+04 on 3 and 544723 DF,  p-value: < 2.2e-16

We see from regressions that using what we would intuitively associate with higher property values, like assessed value and property are statistically significant predictors of sales.

Section D

## 
## Call:
## glm(formula = foreclosed ~ property_c + ASSESSEDVALUE + year, 
##     data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.6696  -0.1808  -0.1623  -0.1363   1.0852  
## 
## Coefficients:
##                 Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)   -8.414e-01  1.245e-01   -6.756 1.42e-11 ***
## property_c    -2.661e-03  2.329e-05 -114.243  < 2e-16 ***
## ASSESSEDVALUE  2.606e-06  1.325e-08  196.780  < 2e-16 ***
## year           1.015e-03  6.152e-05   16.495  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1392017)
## 
##     Null deviance: 451656  on 3188180  degrees of freedom
## Residual deviance: 443800  on 3188177  degrees of freedom
##   (7 observations deleted due to missingness)
## AIC: 2761117
## 
## Number of Fisher Scoring iterations: 2

With foreclosures, the same factors seems similarly predictive, but with an inverse relationship with property class. It seems the lower, in number, the property class is, the more likely the property will be foreclosed.

## 
## Call:
## glm(formula = foreclosed ~ property_c + ASSESSEDVALUE + year, 
##     data = df %>% mutate(property_c = as.factor(property_c)))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5725  -0.1959  -0.1816  -0.0255   1.0252  
## 
## Coefficients:
##                 Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)   -1.063e+00  1.233e-01   -8.622   <2e-16 ***
## property_c402 -1.399e-01  6.749e-04 -207.209   <2e-16 ***
## property_c403 -4.991e-02  1.468e-03  -33.998   <2e-16 ***
## property_c404  1.083e-01  8.469e-03   12.782   <2e-16 ***
## property_c446 -1.426e-01  1.398e-02  -10.200   <2e-16 ***
## property_c447 -1.941e-01  1.708e-02  -11.366   <2e-16 ***
## property_c448 -1.862e-01  4.200e-03  -44.334   <2e-16 ***
## property_c461 -1.704e-01  1.518e-03 -112.280   <2e-16 ***
## property_c465 -1.658e-01  5.050e-03  -32.839   <2e-16 ***
## property_c483  5.195e-02  2.180e-02    2.383   0.0172 *  
## ASSESSEDVALUE  1.961e-06  1.373e-08  142.847   <2e-16 ***
## year           6.085e-04  6.114e-05    9.951   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1373675)
## 
##     Null deviance: 451656  on 3188180  degrees of freedom
## Residual deviance: 437951  on 3188169  degrees of freedom
##   (7 observations deleted due to missingness)
## AIC: 2718835
## 
## Number of Fisher Scoring iterations: 2

Before we assumed that property class could be taken an an ordinal, however if we properly treat it as a categorical variable we see that two classes in particular are associated with foreclosure. Those classes are converted residences and residential renter zones.